-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for sourcing chain data from Electrum #486
base: main
Are you sure you want to change the base?
Conversation
b386ee7
to
9352b2e
Compare
568d10d
to
678cf91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm familiarizing myself with the codebase but I tried running a node locally with the changes here pointing to an electrum server and basic operations are working. I just have a couple of questions, if I may, to get a better understanding.
Not strictly related to the PR but maybe I can sneak it in, why is it that for bitcoin core rpc and electrum they are in their separate bitcoind_rpc.rs
and electrum.rs
files but for esplora it is all in the mod.rs
? I see that that the EsploraAsyncClient
is directly used without a wrapper but wanted to see what were the reasons for it (:
Hmm, for one it has historic reasons, but also
Right, that's essentially the main reason why it still lives in |
9608e66
to
3142c41
Compare
466ab9f
to
e39b1f2
Compare
Blocked on rust-bitcoin/corepc#111 for now. |
0d401fe
to
dda6bc7
Compare
dda6bc7
to
edb4b10
Compare
Rebased on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question that I asked myself most during review is whether the way this PR deals with the runtime and start/stop is really the best option for now.
edb4b10
to
97f4bd9
Compare
2e12bf1
to
be2496e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed pending feedback and rebased on main
to resolve minor conflict.
be2496e
to
7cb0d1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not all the way through, but posting this now.
Main structural comment is the many instances of matching on backend and whether that can be converted to separate objects. More OO style and better readable.
|
||
let spawn_fut = self.runtime.spawn_blocking(move || electrum_client.batch_call(&batch)); | ||
|
||
let timeout_fut = tokio::time::timeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to have these timeouts here and is there nothing in electrum client? It surprises me because apparently retries are implemented in the client code, but not timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a timeout parameter but we previously learned on the Esplora side to not necessarily trust the timeout parameters of the libraries we use (namely, reqwest
which is used by esplora-client
showed delayed handling of concurrent requests if the futures were polled from different threads). So we introduce reliable timeouts on top of the library-provided ones as a belt-and-suspender measure to make sure we never get into a state where a misbehaving task can lock up all of our worker threads. I guess we could consider to enable the timeouts on electrum-client
in addition to the ones we set, but I'm not sure if coming up with yet another magic number is worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't electrum different because it isn't an async library?
I think locking up worker threads can still happen because the async task may finish on the timeout, but the worker thread remains blocked? It isn't cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Now added a fixup also setting the timeout parameter on the electrum_client
side (where we can).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a point though in having an async timeout wrapper if the spawned thread would hang anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. For one, now setting the timeouts hopefully has the spawned thread returning eventually. But if it would do so for some reason, it would be preferable to at least try to continue (while risking eventually running out of blocking worker threads if the issue persists), instead of just locking up and not continuing with subsequent sync attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?
If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.
Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?
The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?
No, I think the calls were cancelled and we then simply continued on the next timer tick.
If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.
I don't think it's super essential as in most circumstances when the connection would fail, we'd still error once the connection is dropped, IIUC. But yes, we probably don't want to lean too much on behavior that might also very well OS-specific.
Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?
Chain syncing and fee rate updates are critical for the operation of a Lightning node. When some process would really get stuck for an unexpected reason, we basically have two options: panic, hoping that a restart or operator intervention would fix the issue, or retry, risking that we might exhaust the thread pool eventually if the exact same unexpected behavior persists 512 times in a row. I think the latter is preferable.
The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.
Yeah, tbh. I mostly added it for the sake of uniformity across the different chain sources and as a general 'belt-and-suspenders' kind of thing. If we don't think 'belt-and-suspenders' are needed, we could consider dropping the timeouts everywhere and relying on the dependencies. I'm on the fence about going this way.
But, tbh., I currently don't fully see why being doubly cautious in this amalgamation between async runtimes and blocking threads is a bad thing, especially since the behavior of the dependency crates might very well (silently) change down the line, even if we review their code now and are positive that the double timeouts are not required in one case or another. FWIW, there is a reason why we added them in the first place, namely that users in production had their node lock up for a really long time due to completely unpredictable behavior arising from reusing reqwest
clients across different thread contesxts (reusing clients is explicitly endorsed in their docs, btw).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few final non-blocking things about code re-use.
96ee279
to
cb5dc5b
Compare
Squashed fixups without further changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly there. Just the one question about the double timeouts.
|
||
let spawn_fut = self.runtime.spawn_blocking(move || electrum_client.batch_call(&batch)); | ||
|
||
let timeout_fut = tokio::time::timeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also set retries now. I saw that it was 1 previously (the default). Did the earlier test with the wifi disconnect pass because that one retry happened after you already reconnected wifi?
If I understand this correctly, setting the timeout on the electrum client itself is essential to not end up with a potential graveyard of stuck threads? This might be important to add as a comment.
Also curious to hear why you think that retrying over and over again when a previous request is still stuck is a good idea? It will eventually exhaust the thread pool, right? And then stop independent processes within LDK too?
The original reason for the double timeout is because esplora-client turned out to be not trustworthy. But, as suggested above, isn't electrum different because it isn't an async library and it is unlikely for the timeout to fail? Looking at the sources, it seems that the timeout is set on a pretty low level for all calls.
.. we need to bump the version, while making sure `electrsd` and CLN CI are using the same version. As the `blockstream/bitcoind` version hasn't reached v28 yet, we opt for v27.2 everywhere for now.
We upgrade our tests to use `electrum-client` v0.22 and `electrsd` v0.31 to ensure compatibility with `bdk_electrum` were about to start using.
By default `rustls`, our TLS implementation of choice, uses `aws-lc-rs` which requires `bindgen` on some platforms. To allow building with `aws-lc-rs` on Android, we here install the `bindgen-cli` tool before running the bindings generation script in CI.
We here setup the basic API and structure for the `ChainSource::Electrum`. Currently, we won't have a `Runtime` available when initializing `ChainSource::Electrum` in `Builder::build`. We therefore isolate any runtime-specific behavior into an `ElectrumRuntimeClient`. This might change in the future, but for now we do need this workaround.
Currently, we won't have a `Runtime` available when initializing `ChainSource::Electrum`. We therefore isolate any runtime-specific behavior into the `ElectrumRuntimeStatus`. Here, we implement `Filter` for `ElectrumRuntimeClient`, but we need to cache the registrations as they might happen prior to `ElectrumRuntimeClient` becoming available.
.. as we do with `BitcoindRpc`, we now test `Electrum` support by running the `full_cycle` test in CI.
cb5dc5b
to
db756c6
Compare
The |
Closes #196.
So far, we support Esplora and Bitcoind RPC chain sources. In this PR we add Electrum support based on the blocking
rust-electrum-client
, and it'sbdk_electrum
andlightning-transaction-sync
counterparts.Due to the blocking nature of
rust-electrum-client
and as it directly connects to the server uponClient::new
(see bitcoindevkit/rust-electrum-client#166), we ended up wrapping the runtime-specific behavior in anElectrumRuntimeClient
object that is initialized and dropped inNode::start
andstop
, respectively.One thing missing that we still need to consider is how we'd reestablish connections to the remote after they have been lost for one reason or another. IMO, that behavior should live inrust-electrum-client
to avoid all users having to duplicate it, so it's pending resolution of bitcoindevkit/rust-electrum-client#165As we did with
bitcoind
-RPC, Electrum support is tested by adding anotherfull_cycle
integration test.